GRIF-716.2: Add workflow_dispatch for promoting bricks stable tag#2081
GRIF-716.2: Add workflow_dispatch for promoting bricks stable tag#2081santos1709 wants to merge 1 commit into
Conversation
bbb2b5d to
400814e
Compare
jiri-staroba
left a comment
There was a problem hiding this comment.
Code review findings — 4 issues (1 critical, 3 moderate).
| dry-run: | ||
| description: 'Dry-run only — print crane commands without executing' | ||
| required: true | ||
| default: false |
There was a problem hiding this comment.
[CRITICAL] Wrong Vault role/path for stable writes
This workflow runs crane tag ${INFRA_REPO_URL}/stable/... (a write into stable/) but requests the staging credentials:
role: ecr-push
secrets: secret/data/v3/int/ecr/infra1-user-ecr-rw ...Every other stable write in this repo uses the stable-tier creds:
role: ecr-ii-push
secrets: secret/data/v2/data-special/infra1-user-ecr-rw ...(see promote-to-stable.yaml and the helm-push action defaults).
The deliberate split (staging → ecr-push/v3, stable → ecr-ii-push/v2) strongly implies these credentials lack push permission on stable/ repos. If so, every crane tag will fail with AccessDenied/403 and no images will be retagged.
Fix: use role: ecr-ii-push + secret/data/v2/data-special/infra1-user-ecr-rw, matching promote-to-stable.yaml.
There was a problem hiding this comment.
actually, the reason I choose secret/data/v3/int... instead of secret/data/v2... because of the following error:
failed to retrieve vault token. code: ERR_NON_2XX_3XX_RESPONSE, message: Response code 400 (Bad Request), vaultResponse: {"errors":["error validating claims: claim \"event_name\" does not match any associated bound claim values"]}
this happened bc the role ecr-ii-push doesn't have permission for workflow_dispatch type. As per https://github.com/gooddata/iac/blob/4ce5aec0c83747b50baa149e2132013a75b2f9c1/vault/vault.intgdc.com/auth/jwt/github/terragrunt.hcl#L526
If we check the promote_to_stable.yaml we can see the type is workflow_call.
We can definitely use ecr-ii-push, but we would need to request to infra guys to add workflow_dispatch for ecr-ii-push.
I tested with ecr-push role and it worked
| @@ -0,0 +1,94 @@ | |||
| name: "LCM: Retag stable image to major version" | |||
| run-name: "Retag stable lcm-bricks ${{ inputs.tag }} → M<major>-<cluster>" | |||
There was a problem hiding this comment.
[MODERATE] dry-run defaults to false — production writes on first click
With default: false, a user who opens Actions → Run workflow, fills in the tag, and clicks Run workflow without touching the dry-run checkbox immediately executes real crane tag against prod stable repos for 8 clusters × 2 images, with no confirmation gate.
Consider defaulting to true so a dry-run is the safe path and the user must explicitly opt into production writes:
default: true| uses: aws-actions/configure-aws-credentials@v4 | ||
| with: | ||
| aws-access-key-id: ${{ env.AWS_ACCESS_KEY }} | ||
| aws-secret-access-key: ${{ env.AWS_SECRET_KEY }} |
There was a problem hiding this comment.
[MODERATE] Hardcoded lcm-bricks-nextversion may not exist for this tag
lcm-bricks-nextversion is only promoted to stable/ when it was built (i.e. VERSION changed in the triggering push and paths-filter included it). If this retag is run for a tag where stable/lcm-bricks-nextversion:TAG was never promoted, crane tag fails because the source manifest is absent.
With set -euo pipefail, the job aborts mid-run — potentially having already retagged all clusters for lcm-bricks but none for lcm-bricks-nextversion, leaving a partially-tagged state.
Consider skipping images whose source tag is absent rather than hard-failing:
if crane manifest "${src}" > /dev/null 2>&1; then
crane tag "${src}" "${major_tag}"
else
echo "[warn] ${src} not found, skipping"
fiThere was a problem hiding this comment.
Considering that this action is a manual triggered actually when a PR is merged - meaning CI has to have passed and consequently all the builds were successfully. I don't see an issue here
| echo "Tagged ${image}:${TAG} → ${image}:${major_tag}" | ||
| fi | ||
| done | ||
| done |
There was a problem hiding this comment.
[CLEANUP] Cluster list and major extraction duplicated between steps
clusters=(na1 ca2 perf1 bom1 syd1 na3 eu1 fra1) and major=$(echo "$TAG" | cut -d. -f1) appear in both the Retag images step and this Summary step. If a cluster is added/removed in one place and missed in the other, the summary silently misreports what was actually tagged.
Consider setting these once as job-level env vars, or sharing via $GITHUB_OUTPUT from the retag step, and reading them here.
There was a problem hiding this comment.
fixed, worked with the same env var for both steps
626dc5b to
a26ff75
Compare
a26ff75 to
1c48a3b
Compare
| url: ${{ secrets.VAULT_ADDRESS }} | ||
| method: jwt | ||
| path: jwt/github | ||
| role: ecr-push |
There was a problem hiding this comment.
[CRITICAL — still not fixed] Wrong Vault role/path for stable/ writes
The Vault credentials are still the staging-tier ones:
role: ecr-push
secrets: secret/data/v3/int/ecr/infra1-user-ecr-rwThis workflow writes to stable/ (via crane tag ${INFRA_REPO_URL}/stable/...), and every other stable/ write in this repo uses:
role: ecr-ii-push
secrets: secret/data/v2/data-special/infra1-user-ecr-rwSee promote-to-stable.yaml (line 32–35) and the .github/actions/container-build-push/action.yaml defaults. The deliberate staging/stable credential split strongly implies ecr-push+v3/int does not have push permission on stable/ repos — this will likely fail with AccessDenied at runtime.
| Run the [LCM: Retag stable image to major version](${{ github.server_url }}/${{ github.repository }}/actions/workflows/lcm-stable-retag.yaml) workflow dispatch: | ||
| - Go to **Actions → LCM: Retag stable image to major version → Run workflow** | ||
| - Set **tag** to \`${{ needs.prepare-build.outputs.service_version }}\` | ||
| - Leave **dry-run** unchecked and click **Run workflow** |
There was a problem hiding this comment.
[NEW] Instruction stale after dry-run default changed to true
Now that dry-run defaults to true, "leave dry-run unchecked" means the workflow runs in dry-run mode — no images will actually be tagged.
GitHub Actions renders boolean inputs as a true/false dropdown, not a checkbox, so "unchecked" is also the wrong UI term.
Suggested fix:
- Set **dry-run** to `false` and click **Run workflow**
The gh CLI example two lines below already passes -f dry-run=false explicitly, so that part is correct.
This PR:
workflow_dispatchfor manually promoting a desired brick lcm tag to the one to be used (Mnwherenis an int > 0)lcm_pipelieneto let the user know about the manual dispatch needed for release once a PR withVersionupdate is merged